In this study a barcode mouse was treated with tamoxifen to induce barcode generation in situ. Cells that had undergone barcode generation then express GFP. From this mouse LSK cells from the bone marrow were isolated by FACS and split into two distinct fractions GFP positive cells, and GFP negative cells.

Cells were sequenced on the 10X Genomics platform, a droplet based approach to isolate single-cells for sequencing. As described in AlJahani et al (2018): “Droplet-based single-cell gene expression approaches use microfluidic chips to isolate single cells along with single beads in oil-encapsulated droplets, using microfluidics to bring oil, beads, and cell suspensions together in such a way that each droplet contains at most a single cell.20 The beads are coated with DNA oligos that are composed of a poly(T) tail at the 3′ end for the capture of cellular mRNAs, and at the 5′ end both a cell barcode that is identical for every oligo coating an individual bead and a library of individual unique molecular identifier (UMI) barcodes of high diversity, each UMI different for every oligo on the bead. The transcripts from each individual cell captured and labeled by the DNA oligos attached to a bead within the droplets are reverse transcribed, amplified with PCR, and sequenced using a high-throughput platform, after breaking and pooling droplet contents. The resulting sequences are aligned to a reference genome in order to annotate each transcript with its gene name. The cell barcodes on the aligned sequences allow for the computational linking of each gene transcript to its cell of origin. The number of copies of individual gene transcripts expressed in each individual cell is tallied using the UMIs, allowing the assembly of digital gene expression matrices (DGEs), which are tables of cell barcodes and gene counts.”

Step 1: Prepare the workspace, loading the appropriate packages and reading the data in with associated metadata

#clear the workspace
rm(list=ls())
#set the working directory and load in required libraries
setwd("/Users/jasoncosgrove/Dropbox (Team Perié)/Jason/Experiments/JCB6_VDJMouse/Publication")
library(Seurat)
library(scran)
library(org.Mm.eg.db)
library(clustree)
library(destiny)
library(SingleCellExperiment)
library(scater)
library(scales)
library(dplyr)

source("helpermethods.R")

#set the seed for the analysis so that it is reproducible
set.seed(12345)

create a seurat objet

# This method takes the outputs of the cellRanger pipeline and converts the data into a seurat object.
# For more details see the helpermethods.R file
lsks <- generateSeuratObject()
## [1] "dataset contains 27998 genes"
## [1] "dataset contains 2873 cells"

Step 2: Data QC

To assess the quality of the data we assess the library sizes, numbers of genes expressed and mitochondrial content per cell. Cells with very low library sizes are typically because of poor capture quality pontentially due to cell death, premature rupture, or capture of random mRNA escaping from cells, consequently cells with low library sizes are also filtered out from downstream analyses.

Another important QC metric is mitochondrial content. As discussed in AlJanahi et al (2018) “High numbers of mitochondrial transcripts are indicators of cell stress, and therefore cells with elevated mitochondrial gene expression are often not included in the analysis, because most experiments will not benefit from clustering cells based on stress levels. However, just as with number of transcripts, this parameter is highly dependent on the tissue type and the questions being investigated.”

#find the percentage of mitochondrial genes
lsks<- Seurat::PercentageFeatureSet(object = lsks, pattern = "^mt-", col.name = "percent.mt")

#plot key quality control metrics
VlnPlot(lsks, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)

# Filter outlier cells based on visual inspection of the above plot
# we shouldnt have a lower bound for mitochondrial because we know that LT-HSCs have low
# mitochondrial content relative to other cells. 
lsks <- subset(lsks, subset = nFeature_RNA > 1500
               & percent.mt < 10 & percent.mt > 1)


print(paste("dataset contains" , nrow(lsks),"genes after filtering"), sep = " " )
## [1] "dataset contains 13183 genes after filtering"
print(paste("dataset contains" , ncol(lsks),"cells after filtering"), sep = " " )
## [1] "dataset contains 2807 cells after filtering"

Step 2: Data normalisation

When analyzing sequencing data, normalization to eliminate batch effects is crucial if multiple sequencing runs and between 10X runs are to be compared with each other. These batch effects can be caused by often unavoidable technical variations such as the duration samples were kept on ice, number of freeze-thaw cycles, method of RNA isolation, sequencing depth, etc.

An additional consideration is that droplet-based sequencing in addition consists of thousands of individual cell experiments, hence cell-specific biases must also be considered when normalizing, in order to be able to compare the expression of one cell to another. A notable cell-specific bias is caused by mRNA capture efficiency, where the mRNA molecules are not captured by the bead at the same proportion in all droplets. As individual cells are not all of the same type a key consideration is how to retain cell to cell variability while eliminating technical noise.

To normalise our data we use sctransform. Briefly, in this approach the normalized values are Pearson residuals from regularized negative binomial regression, where cellular sequencing depth is used as a covariate. To scale the data, we center it around zero for each gene. The scaled data is used just for generating visualizations of the data, while the normalized matrix is used for statistical comparisons such as differential expression. In their preprint they show that an unconstrained negative binomial model may overfit scRNA-seq data, and overcome this by pooling information across genes with similar abundances to obtain stable parameter estimates.

# To normalise the data we use two different approaches, seurats default method (results stored in RNA), 
# and the scTransform method (results stored in SCT). 
lsks <- NormalizeData(lsks, verbose = FALSE)
lsks <- SCTransform(object = lsks, verbose = FALSE,variable.features.n = 2500, conserve.memory = F,return.only.var.genes = F)

sct.res <- checkNormalisation("SCT", lsks)
rna.res <- checkNormalisation("RNA", lsks)

#plot the effects of the normalisations
plot(rna.res$p1)

plot(sct.res$p2)

plot(rna.res$p2)

From our PCA plots we observe 3 main features: how much of the variance we retain following normalisation, the size of the dots which indicates the numbers of genes expressed per cell, and also the colour of the dots, which indicates the number of UMIs per cell. While the SCT normalisation retains less of the variance in principle component once we can see that it can better deal with eliminating technical noise compared to the default seurat approach. We can see that the effect of library size is not completed eliminated however we can expect some variation due to intrinsic biological variability. Later in the analysis when we visualise the data using PCA and non-linear equivalents we can see if there are any outlier cells that are due to technical artefacts and can revisit our QC and normalisation accordingly.

Step 3: Addressing the batch effect in our data

Plotting the average expression of all genes, shows that many genes are enriched in the GFP pos group but never the opposite. This is indicative of a batch effect that has arisen from samples being run separately on the 10X platform.

# Looking at our QC metrics, we see the datasets are fairly similar
VlnPlot(lsks, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3, group.by = "batch")

#However if you look more closely at the data you will see that there is a clear batch effect. 
#if you look at the data you will see that there is evidence of a batch effect in which many genes are upregulated in GFP
avg <- log1p(AverageExpression(lsks, add.ident = "batch")$SCT)
avg$ratio <- avg$barcode_GFPpos / avg$barcode_GFPneg
avg <- avg[avg$barcode_GFPpos > 0.5 &  avg$barcode_GFPneg > 0.5, ]
avg.ordered <- avg[order(avg$ratio,decreasing = T),]
p1 <- ggplot(avg, aes(barcode_GFPpos, barcode_GFPneg)) + geom_point() + ggtitle("barcode mouse")
p1 <- LabelPoints(plot = p1, points = rownames(avg.ordered)[1:10], repel = TRUE)

p1

To remove the batch effect in our data we use our batch variable in metadata stating whether a cell is GFP positive or negative and then regress this out using a negative binomial approach.

#run the regression
lsks <- ScaleData(object = lsks, vars.to.regress = "batch", model.use = "negbinom",
                  do.center = T, do.scale = T,assay = "SCT")

#visualise the impact of the batch transformation on our data
batch.genes <- rownames(avg.ordered)[1:5]

#Lets see the raw data for 5 genes that are affected by the batch effect
VlnPlot(lsks, features = batch.genes, group.by = "batch", slot = "counts",assay="SCT")

#Now lets look at the effect of normalisation on the expression of these genes
VlnPlot(lsks, features = batch.genes, group.by = "batch", slot = "data",assay="SCT")

#Finally we visualise how well our batch correction approach is working
VlnPlot(lsks, features = batch.genes, group.by = "batch", slot = "scale.data",assay="SCT")

Dimensionality reduction and data visualisation

We perform dimensionality reduction on 2500 variably expressed genes using both principle component analysis, an approach to find the linear combination of genes that are the greatest source of variance in the data.We visualize our data using the non-linear dimensionality reduction technique UMAP. This approach is analogous to PCA, but can also identify non-linear patterns in the data. The goal of the algorithm is to learn the underlying manifold of the data in order to place similar cells together in low-dimensional space. UMAP is preferable to t-SNE as it is faster to compute, and does not require tuning of the perplexity parameter.

lsks <- RunPCA(lsks, npcs = 50, verbose = FALSE)
ElbowPlot(lsks, ndims = 30)

lsks <- RunUMAP(lsks, reduction = "pca", dims = 1:15) 

#visualising the data we see we have an outlier population
DimPlot(lsks, reduction = "umap", group.by = "batch")

#Outlier detection

From our data visualisation we see a clear outlier population. Lets generate clusters to isolate this group and look at QC metrics and marker genes to see whether we should exclude this group from further analysis or not.

This analysis shows that the outlier group is infact B-cell progenitors, and not MPP4 cells so we exclude them from downstream analyses

lsks <- FindNeighbors(object = lsks, dims = 1:15, verbose = FALSE)
lsks <- FindClusters(lsks, dims.use = 1:15, print.output = FALSE,resolution=0.1,force.recalc= T)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2807
## Number of edges: 92008
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9185
## Number of communities: 4
## Elapsed time: 0 seconds
DimPlot(object = lsks, label = F,reduction = "umap", pt.size = 2) 

#we find a clear outlier population, could this population be a technical artefact. 
VlnPlot(lsks, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)

#lets also look for markers of this cluster to give us an idea of what cell type it is
markers <- FindMarkers(lsks,ident.1 = "3",test.use = "LR")
head(markers)
##                p_val avg_logFC pct.1 pct.2    p_val_adj
## Vpreb3  4.283754e-78  3.049168     1 0.046 5.647272e-74
## Vpreb1  4.283754e-78  3.025766     1 0.101 5.647272e-74
## Cd79a   4.283754e-78  2.601769     1 0.035 5.647273e-74
## Ebf1    4.283754e-78  3.452091     1 0.051 5.647273e-74
## Igll1   7.253976e-77  3.060957     1 0.057 9.562917e-73
## Chchd10 4.980094e-75  2.544778     1 0.417 6.565258e-71
# based on the markers we conclude that these cells are in fact B cells, 
# and are not part of the progenitor compartment, 
# we checked this by looking at the MPP4 signature
lsks <- geneSetAnnotation(lsks, "MPP4_Pietras")
VlnPlot(lsks, "MPP4_Pietras1")

#remove the B cells from downstream analyses
non.bcells <- colnames(lsks)[lsks@meta.data$SCT_snn_res.0.1 != "3"]
lsks <- lsks[,non.bcells]

re-run the data visualisation on the filtered data. This time we use a lower number of PCs as input into the UMAP algorithm to better highlight the major aspects of variability within this cell population

#use low number of dimensions for visualisations
lsks <- RunUMAP(lsks, reduction = "pca", dims = 1:10,min.dist = 2, spread = 1)
DimPlot(lsks, reduction = "umap", group.by = "batch", pt.size = 2)

#Data clustering To look at key cell subsets within the progenitor compartment we perform clustering using the Louvain algorithm. As 8 clusters is the most robust result when varying the resolution parameter we choose this number of clusters for downstream analyses.

lsks <- FindNeighbors(object = lsks, dims = 1:10, verbose = FALSE,reduction = "pca")
for(i in seq(from=0.1, to=1, by=0.1)){
  lsks <- FindClusters(object = lsks, resolution = i, algorithm = 2, verbose = F)
}
clustree(lsks, prefix = "SCT_snn_res.",assay = "SCT") + NoLegend()

#lets choose a value of the resolution parameter that gives the most stable results
lsks <- FindClusters(lsks, dims.use = 1:10, print.output = FALSE,resolution=0.5,force.recalc= T,algorithm = 2, reduction = "pca")
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2775
## Number of edges: 86849
## 
## Running Louvain algorithm with multilevel refinement...
## Maximum modularity in 10 random starts: 0.8233
## Number of communities: 8
## Elapsed time: 0 seconds
DimPlot(object = lsks, label = F,reduction = "umap", pt.size = 2) 

#lets find marker genes for our key clusters
cluster.markers <- FindAllMarkers(lsks, test.use = "LR")
x <-cluster.markers %>% group_by(cluster) %>% top_n(3, avg_logFC)
print(tbl_df(x), n=40)
## # A tibble: 24 x 7
##        p_val avg_logFC pct.1 pct.2  p_val_adj cluster gene         
##        <dbl>     <dbl> <dbl> <dbl>      <dbl> <fct>   <chr>        
##  1 5.67e-140     0.665 0.864 0.394  7.47e-136 0       Lig1         
##  2 8.12e-130     0.593 0.76  0.276  1.07e-125 0       Hells        
##  3 1.52e- 82     0.911 0.705 0.417  2.00e- 78 0       2810417H13Rik
##  4 7.26e- 55     0.419 0.935 0.777  9.57e- 51 1       Sox4         
##  5 4.76e- 43     0.749 0.621 0.387  6.28e- 39 1       Wfdc17       
##  6 7.61e- 28     0.579 0.366 0.188  1.00e- 23 1       Dntt         
##  7 3.86e- 82     0.850 0.898 0.626  5.09e- 78 2       Cd74         
##  8 1.16e- 64     0.597 0.858 0.616  1.53e- 60 2       Ly6a         
##  9 3.87e- 56     0.643 0.611 0.306  5.10e- 52 2       Ltb          
## 10 1.03e-165     1.08  0.952 0.381  1.36e-161 3       Pbx1         
## 11 7.90e-106     2.04  0.726 0.284  1.04e-101 3       Pf4          
## 12 1.06e- 30     1.02  0.823 0.793  1.39e- 26 3       Hist1h2ap    
## 13 1.13e-163     1.27  0.962 0.24   1.49e-159 4       Nusap1       
## 14 1.45e-145     1.37  0.979 0.35   1.92e-141 4       Ube2c        
## 15 5.70e- 87     1.33  0.945 0.782  7.52e- 83 4       Hist1h2ap    
## 16 5.33e- 92     1.39  0.988 0.693  7.03e- 88 5       Apoe         
## 17 7.58e- 44     0.811 0.783 0.364 10.00e- 40 5       F2r          
## 18 2.63e- 30     0.590 0.82  0.503  3.47e- 26 5       Gata2        
## 19 7.16e- 30     3.46  0.641 0.391  9.44e- 26 6       Ngp          
## 20 1.00e- 29     3.78  0.804 0.81   1.32e- 25 6       S100a8       
## 21 1.06e- 29     3.75  0.728 0.602  1.39e- 25 6       S100a9       
## 22 1.96e-132     3.00  0.952 0.166  2.58e-128 7       Car1         
## 23 1.79e- 96     2.21  0.976 0.37   2.36e- 92 7       Mt1          
## 24 5.36e- 88     2.40  0.774 0.095  7.06e- 84 7       Mt2
#calculate the expression of key gene-signatures so that we can annotate our clusters
gene.set.names <- c("WilsonMolO","MPP2_Pietras", "MPP3_Pietras" , "MPP4_Pietras")
lsks <- geneSetAnnotation(lsks, gene.set.names)

#see the expression of our signatures in our different clusters
VlnPlot(lsks,features = c("WilsonMolO1"),pt.size = 0) #cluster 2 is LTHSCs

VlnPlot(lsks,features = c("MPP2_Pietras1"),pt.size = 0) #clusters3 and 7 is LTHSCs

VlnPlot(lsks,features = c("MPP3_Pietras1"),pt.size = 0) #cluster 6 is LTHSCs

VlnPlot(lsks,features = c("MPP4_Pietras1"),pt.size = 0) #cluster 1 is LTHSCs

#lets rename our clusters
new.cluster.ids <- c("not_annotated", "MPP4", "LTHSC", "MPP2", "not_annotated", "not_annotated", 
    "MPP3", "MPP2")

names(new.cluster.ids) <- levels(lsks)
lsks <- RenameIdents(lsks, new.cluster.ids)
#as we see from this plot we get good agreement between the two approaches
DimPlot(lsks,cols = c("light grey", "cornflowerblue", "darkolivegreen3","orange", "plum"),pt.size = 1.5) + NoLegend()

features.to.plot <- c("Cd48", "Flt3", "Procr","WilsonMolO1",
                      "MPP2_Pietras1","MPP3_Pietras1","MPP4_Pietras1")

for(i in 1:length(features.to.plot)){
  VlnPlot(lsks, features = c(features.to.plot[i]),pt.size = 0, cols = c("light grey", "cornflowerblue", "darkolivegreen3","orange", "plum"))
}

DimPlot(lsks,cells.highlight = colnames(lsks)[lsks@meta.data$batch == "GFPpos"],
         cols.highlight = c( "green4"), pt.size = 1, sizes.highlight = 1.5) + NoLegend()

Cluster enrichment

#generate a barplot showing the distribution of cells within the clusters
counts <- table(Idents(lsks), lsks@meta.data$batch)
prop.df <- data.frame(cbind(data.frame(prop.table(counts[,1])),data.frame(prop.table(counts[,2]))))
colnames(prop.df) <- c("gfp.neg", "gfp.pos")


tiff("/users/jasoncosgrove/Desktop/Barplot.tiff", res = 300,width = 4, height = 4, units = 'in')
barplot(as.matrix(prop.df),col = c("light grey", "cornflowerblue", "darkolivegreen3","orange", "plum"), axes = F)
dev.off()
## quartz_off_screen 
##                 2
#From our barplot it looks like the GFPpos are enriched in the MPP3 cluster lets test this using a Fishers exact test
cluster_of_interest <- "MPP3"
lsks@meta.data$clusters <- lsks@active.ident
(input_for_Fisher <- matrix(c(ncol(lsks[, lsks@meta.data$clusters == cluster_of_interest &
                                          lsks@meta.data$batch == "GFPpos"]), 
                       ncol(lsks[, lsks@meta.data$clusters != cluster_of_interest &
                                          lsks@meta.data$batch == "GFPpos"]), 
                       ncol(lsks[, lsks@meta.data$clusters == cluster_of_interest &
                                          lsks@meta.data$batch == "GFPneg"]), 
                       ncol(lsks[, lsks@meta.data$clusters != cluster_of_interest &
                                          lsks@meta.data$batch == "GFPneg"])),
                     nrow = 2,
                     dimnames = list(cellType = c("Selected_cluster", "Others"),
                                     state = c("GFPpos", "GFPneg"))))
##                   state
## cellType           GFPpos GFPneg
##   Selected_cluster     46     46
##   Others              566   2117
fisher.test(input_for_Fisher, alternative = "two.sided")
## 
##  Fisher's Exact Test for Count Data
## 
## data:  input_for_Fisher
## p-value = 1.916e-09
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  2.402188 5.816742
## sample estimates:
## odds ratio 
##   3.737982

As we can see from this analysis the GFP postive cells are distributed throughout the LSK compartment in a similar way to non-GFP cells. However, the GFP positive cells are enriched within the MPP3 compartment as assessed using a Fishers exact test.

References

  1. Butler, Andrew et al. Integrating Single-Cell Transcriptomic Data across Different Conditions, Technologies, and Species. Nature Biotechnology (2018)

  2. Lun A, Risso D (2018). SingleCellExperiment: S4 Classes for Single Cell Data. R package version 1.4.0.

  3. AlJanahi, Aisha A., Mark Danielsen, and Cynthia E. Dunbar. ‘An Introduction to the Analysis of Single-Cell RNA-Sequencing Data’. Molecular Therapy - Methods & Clinical Development (2018)